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^vq ■ 1. Introduction. We congratulate the authors on their excellent work. 

The paper combines elegant theory and useful practical results in an in- 
Li^ ' triguing manner. The LAR-Lasso-boosting relationship opens the door for 

^y^ • new insights on existing methods' underlying statistical mechanisms and for 

the development of new and promising methodology. Two issues in particu- 
lar have captured our attention, as their implications go beyond the squared 
error loss case presented in this paper, into wider statistical domains: ro- 
bust fitting, classification, machine learning and more. We concentrate our 
discussion on these two results and their extensions. 






>. 

i^ , 2. Piecewise linear regularized solution paths. The first issue is the 

r^ ' piecewise linear solution paths to regularized optimization problems. As 

vQ ■ the discussion paper shows, the path of optimal solutions to the "Lasso" 

(^ I regularized optimization problem 

O' (2.1) P{X) = aTgmin\\y-Xp\\l + 

j^ , is piecewise linear as a function of A; that is, there exist oo > Aq > Ai > • • • > 

^ I Am = such that VA > 0, with Afc > A > Xk+i, we have 

.>■ /3(A)=/3(Afc)-(A-Afc)7fc. 

'^jj . In the discussion paper's terms, 7^ is the "LAR" direction for the fcth step 

_C^ I of the LAR-Lasso algorithm. 

This property allows the LAR-Lasso algorithm to generate the whole 
path of Lasso solutions, /3(A), for "practically" the cost of one least squares 
calculation on the data (this is exactly the case for LAR but not for LAR- 
Lasso, which may be significantly more computationally intensive on some 
data sets). The important practical consequence is that it is not necessary 
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2 DISCUSSION 

to select the regularization parameter A in advance, and it is now compu- 
tationally feasible to optimize it based on cross-validation (or approximate 
Cp, as presented in the discussion paper). 

The question we ask is: what makes the solution paths piecewise linear? 
Is it the use of squared error loss? Or the Lasso penalty? The answer is that 
both play an important role. However, the family of (loss, penalty) pairs 
which facilitates piecewise linear solution paths turns out to contain many 
other interesting and useful optimization problems. 

We now briefly review our results, presented in detail in Rosset and Zhu 
(2004). Consider the general regularized optimization problem 

(2.2) /3(A) = argmin5^L(y,,x*/3) + AJ(/3), 

i 

where we only assume that the loss L and the penalty J are both convex 
functions of /? for any sample {x*, j/j}^]^. For our discussion, the data sample 
is assumed fixed, and so we will use the notation L{f3), where the dependence 
on the data is implicit. 

Notice that piecewise linearity is equivalent to requiring that 

oX 

is piecewise constant as a function of A. If L, J are twice differentiable 
functions of /3, then it is easy to derive that 

(2.3) ^ = -(V2l(/3(A)) + AVV(/3(A)))^ Vj(/3(A)). 

With a little more work we extend this result to "almost twice differentiable" 
loss and penalty functions (i.e., twice differentiable everywhere except at a 
finite number of points), which leads us to the following sufficient conditions 
for piecewise linear /3(A): 

1. V"^ L{I3{\)) is piecewise constant as a function of A. This condition is met 
if L is a piecewise-quadratic function of /?. This class includes the squared 
error loss of the Lasso, but also absolute loss and combinations of the two, 
such as Ruber's loss. 

2. VJ(/3(A)) is piecewise constant as a function of A. This condition is met 
if J is a piecewise-linear function of /?. This class includes the li penalty 
of the Lasso, but also the loo norm penalty. 

2.1. Examples. Our first example is the "Huberized" Lasso; that is, we 
use the loss 

^ ' ^(y, x/?j - j ^2 ^ 25{\y - x/?| - 5), otherwise. 
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Fig. 1. Coefficient paths for Huberized Lasso (left) and Lasso (right) for data example: 
/3i(A) is the full line; the true model is E{Y\x) — lOxi. 



with the Lasso penalty. This loss is more robust than squared error loss 
against outliers and long-tailed residual distributions, while still allowing us 
to calculate the whole path of regularized solutions efficiently. 

To illustrate the importance of robustness in addition to regularization, 
consider the following simple simulated example: take n = 100 observations 
and p = 80 predictors, where all Xij are i.i.d. A^(0, 1) and the true model is 

(2.5) Vi = W-Xii+Ei, 



(2.6) 



i.i.d. 



0.9-Af(0,l) + 0.1-iV(0,100). 



So the normality of residuals, implicitly assumed by using squared error loss, 
is violated. 

Figure 1 shows the optimal coefficient paths /3(A) for the Lasso (right) 
and "Huberized" Lasso, using 6 = 1 (left). We observe that the Lasso fails in 
identifying the correct model E(Y\x) = lOxi while the robust loss identifies 
it almost exactly, if we choose the appropriate regularization parameter. 



4 DISCUSSION 

As a second example, consider a classification scenario where the loss we 
use depends on the margin yx*/3 rather than on the residual. In particular, 
consider the 1-norm support vector machine regularized optimization prob- 
lem, popular in the machine learning community. It consists of minimizing 
the "hinge loss" with a Lasso penalty: 

This problem obeys our conditions for piecewise linearity, and so we can 
generate all regularized solutions for this fitting problem efficiently. This 
is particularly advantageous in high-dimensional machine learning prob- 
lems, where regularization is critical, and it is usually not clear in advance 
what a good regularization parameter would be. A detailed discussion of 
the computational and methodological aspects of this example appears in 
Zhu, Rosset, Hastie, and Tibshirani (2004). 

3. Relationship between "boosting" algorithms and ii -regularized fit- 
ting. The discussion paper establishes the close relationship between e- 
stagewise linear regression and the Lasso. Figure 1 in that paper illustrates 
the near-equivalence in the solution paths generated by the two methods, 
and Theorem 2 formally states a related result. It should be noted, how- 
ever, that their theorem falls short of proving the global relation between 
the methods, which the examples suggest. 

In Rosset, Zhu and Hastie (2003) we demonstrate that this relation be- 
tween the path of /i-regularized optimal solutions [which we have denoted 
above by /3(A)] and the path of "generalized" e-stagewise (AKA boosting) 
solutions extends beyond squared error loss and in fact applies to any convex 
differ entiable loss. 

More concretely, consider the following generic gradient-based "e-boosting" 
algorithm [we follow Friedman (2001) and Mason, Baxter, Bartlett and Frean 
(2000) in this view of boosting], which iteratively builds the solution path 
/3W: 

Algorithm 1 (Generic gradient-based boosting algorithm). 

L Set /3(o) = 0. 
2. For t=l:T, 

(a) Let jt = arg maxj | ^' ,t_\^ 1 . 

(b) Set p^ = ^^ - .sign( ^^-^pf-^') ) and /jf = pt'\ ^ + 
Jt- 
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Fig. 2. Exact coejficient paths (left) for h- constrained logistic regression and boosting 
coefficient paths (right) with binomial log-likelihood loss on five variables from the ^^spam" 
dataset. The boosting path was generated using e — 0.003 and 7000 iterations. 



This is a coordinate descent algorithm, which reduces to forward stagewise, 
as defined in the discussion paper, if we take the loss to be squared error loss: 
L(yj,x^/3(*~^)) = {yi — x*/?(*~^))^. If we take the loss to be the exponential 
loss, 

L(y„x*/3(*-^))=exp(-y,x*/3(*-i)), 

we get a variant of AdaBoost [Freund and Schapire (1997)] — the original 
and most famous boosting algorithm. 

Figure 2 illustrates the equivalence between Algorithm 1 and the optimal 
solution path for a simple logistic regression example, using five variables 
from the "spam" dataset. We can see that there is a perfect equivalence 
between the regularized solution path (left) and the "boosting" solution 
path (right). 

In Rosset, Zhu and Hastie (2003) we formally state this equivalence, with 
the required conditions, as a conjecture. We also generalize the weaker result. 



6 DISCUSSION 

proven by the discussion paper for the case of squared error loss, to any 
convex differ entiable loss. 

This result is interesting in the boosting context because it facilitates a 
view of boosting as approximate and implicit regularized optimization. The 
situations in which boosting is employed in practice are ones where explicitly 
solving regularized optimization problems is not practical (usually very high- 
dimensional predictor spaces). The approximate regularized optimization 
view which emerges from our results allows us to better understand boosting 
and its great empirical success [Breiman (1999)]. It also allows us to derive 
approximate convergence results for boosting. 

4. Conclusion. The computational and theoretical results of the discus- 
sion paper shed new light on variable selection and regularization methods 
for linear regression. However, we believe that variants of these results are 
useful and applicable beyond that domain. We hope that the two extensions 
that we have presented convey this message successfully. 

Acknowledgment. We thank Giles Hooker for useful comments. 
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